7 June 2017

S0: Prior to lecture

Preparing for this lecture

install.packages("tidyverse") ## Might be a while...
install.packages("janitor")
install.packages("plotly")

S1: Necessary evils of Applied Statistics

Good statistical discoveries don't fall out from the sky

  • Statisticians are great at many things:

    1. Understanding data characteristics
    2. Building statistical/mathematical models
    3. Repeat 1 and 2…like…a lot…
    4. Extract insights
  • But the mother of all these, i.e. preparing data is not trivial. (e.g. STAT2xxx lab exams)

Let \(\boldsymbol{X}\) be the thing I want…

  • The real problem is not applying fancy shampoo for your cat. It is getting your cat into the bathtub.
  • Be alarmed!

Hidden side of being a statistician

  • Assume we have data
  • Assume we have cleaned data
  • Assume we can reveal aspects of the data
  • Assume we can perform statistics on the data
  • Assume we interrogated the right aspects of the data
  • Assume we did everything right, communicate insights to others

Summary of this lecture

  • Powerful tools to organise your data so you can focus the statistics.

  • Passive learning is not going to work.

  • S1: Introduction
  • S2: Reading in data using readr and readxl
  • S3: Basic data cleaning using janitor
  • S4: Clean coding using magrittr
  • S5: Data filtering using dplyr
  • S6: Data visualisation using ggplot2
  • S7: A glimpse into the future of R: tidyverse
  • S8: Conclusion

S2: Reading data

Better read/write data

  • base R functions haven't adapted to the needs of modern users.

  • readr functions are superior in data import warnings, column type handling, speed, scalability and consistency.

library(readr)

Reading data using readr

dirtyIris = read_csv("dirtyIris.csv")
class(dirtyIris)
[1] "tbl_df"     "tbl"        "data.frame"
dirtyIris
# A tibble: 250 x 6
  SepAl....LeNgth `Sepal.?    Width` `petal.Length(*&^` `petal.$#^&Width`
            <dbl>              <dbl>              <dbl>             <dbl>
1      5.80000000                2.6                4.0               1.2
2      5.80000000                2.7                5.1               1.9
3      0.01874617                 NA                 NA                NA
4              NA                 NA                 NA                NA
5              NA                 NA                 NA                NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
#   allEmpty <chr>
  • tibble is a more advanced version of data.frame.

  • readxl and haven (for SAS, SPSS etc.) packages work similarly.

  • We now proceed to data cleaning on the dirtyIris dataset.

S3: Cleaned data

What is clean data?

clean data as a data set that allows you to do statistical modelling without extra processing

  1. Good documentation on the entire data.
  2. Each column is a variable. The name should be intuitive, and:
    • No bad characters @KevinWang009
    • No inconsistent capitalisation (first name vs First Name)
    • No inconsistent separators (cricket_australia vs cricket.australia)
  3. Each row is an observation:
    • No bad characters
    • No poorly designed row names (1, 2, 5, … )
    • No repeated row names (a, a.1, b, … )
  4. Each column should be imported according to their most appropriate type.
    • e.g. Months should be read in as character or Date, not a factor ordered alphabetically (default in R).

Data cleaning in base R

  • Clean data is a well-designed data.frame.

  • base R functions might struggle with all issues we mentioned.

  • Basic data cleaning using janitor package.

  • More advanced data manipulation through dplyr.

janitor: basic data cleaning

  • Clean up the bad column names
library(janitor)
dirtyIris
# A tibble: 250 x 6
  SepAl....LeNgth `Sepal.?    Width` `petal.Length(*&^` `petal.$#^&Width`
            <dbl>              <dbl>              <dbl>             <dbl>
1      5.80000000                2.6                4.0               1.2
2      5.80000000                2.7                5.1               1.9
3      0.01874617                 NA                 NA                NA
4              NA                 NA                 NA                NA
5              NA                 NA                 NA                NA
# ... with 245 more rows, and 2 more variables: `SPECIES^` <chr>,
#   allEmpty <chr>
## Clean up column names
better = clean_names(dirtyIris) 
better
# A tibble: 250 x 6
  sepal_length sepal_width petal_length petal_width    species allempty
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>    <chr>
1   5.80000000         2.6          4.0         1.2 versicolor     <NA>
2   5.80000000         2.7          5.1         1.9  virginica     <NA>
3   0.01874617          NA           NA          NA       <NA>     <NA>
4           NA          NA           NA          NA       <NA>     <NA>
5           NA          NA           NA          NA       <NA>     <NA>
# ... with 245 more rows
## Removing empty rows/columns
evenBetter = remove_empty_cols(better)
evenBetter = remove_empty_rows(evenBetter)

evenBetter
# A tibble: 209 x 5
  sepal_length sepal_width petal_length petal_width    species
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>
1   5.80000000         2.6          4.0         1.2 versicolor
2   5.80000000         2.7          5.1         1.9  virginica
3   0.01874617          NA           NA          NA       <NA>
4   6.50000000         3.0          5.5         1.8  virginica
5   6.30000000         2.9          5.6         1.8  virginica
# ... with 204 more rows
head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
## Similar to na.omit(), any rows with NA will be removed. 
evenBetterBetter = remove_missing(evenBetter) 

glimpse(evenBetterBetter)
Observations: 150
Variables: 5
$ sepal_length <dbl> 5.8, 5.8, 6.5, 6.3, 5.1, 6.2, 5.9, 6.8, 5.0, 6.7,...
$ sepal_width  <dbl> 2.6, 2.7, 3.0, 2.9, 3.8, 2.8, 3.0, 3.2, 3.0, 3.0,...
$ petal_length <dbl> 4.0, 5.1, 5.5, 5.6, 1.9, 4.8, 4.2, 5.9, 1.6, 5.0,...
$ petal_width  <dbl> 1.2, 1.9, 1.8, 1.8, 0.4, 1.8, 1.5, 2.3, 0.2, 1.7,...
$ species      <chr> "versicolor", "virginica", "virginica", "virginic...
cleanIris = evenBetterBetter

This is the original iris data:

glimpse(iris)
Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9,...
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1,...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5,...
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1,...
$ Species      <fctr> setosa, setosa, setosa, setosa, setosa, setosa, ...

S4: Clean coding (Skim through)

Coding complexity increases with the number of brackets

  • The "inside out" structure of coding isn't great for human reading.
mean(cleanIris$sepal_length)
[1] 5.843333
plot(density(cleanIris$sepal_length), col = "red", lwd = 2)

Piping allows you to read code from left to right

  • We introduce a new notation: " x %>% f " means "f(x)". We call this operation as "x pipe f".

  • Reading codes from left to right. And compounded operations are possible.

  • Keyboard shortcut is Cmd+shift+M.

cleanIris$sepal_length %>% mean
[1] 5.843333
cleanIris$sepal_length %>%
  density %>%
  plot(col = "red", lwd = 2)

S5: dplyr: data subsetting master

Traditional way of subsetting data in R (optional)

  • If I want the first two rows of column sepal_length and sepal_width in the cleanIris data:
## Assuming you know the position of column names.
## But what if you resample your data?
cleanIris[1:2, c(1, 2)]

## Assuming you know the position of column names.
## Also assuming the first two columns satisfy certain properties.
cleanIris[1:2, c(T, T, F, F, F)]

## Much better!
## What if you can't type out all the column names
## due to the size of your data?
cleanIris[1:2, c("sepal_length", "sepal_width")]
  • Something more realistic:
cleanIris[(cleanIris[,"sepal_length"] < 5) &
            (cleanIris[,"sepal_width"] < 3), c("petal_length", "sepal_length")]
# A tibble: 4 x 2
  petal_length sepal_length
         <dbl>        <dbl>
1          1.3          4.5
2          1.4          4.4
3          3.3          4.9
4          4.5          4.9
  • (Optional) A pro R user might know about the subset function, but it suffers the same problem of not able to have multiple subsetting criteria without predefined variables.
subset(cleanIris,
       subset = (sepal_length < 5) & (sepal_width < 3),
       select = grep("length", colnames(cleanIris), value = TRUE))
# A tibble: 4 x 2
  sepal_length petal_length
         <dbl>        <dbl>
1          4.5          1.3
2          4.4          1.4
3          4.9          3.3
4          4.9          4.5

Subsetting data using dplyr

  • Think of subsetting rows and columns as two completely different procedures:
  • select columns are operations on variables.
  • filter rows are operations on observations.

  • See cheatsheet.

library(dplyr)

cleanIris %>%
  filter(sepal_length < 5, sepal_width < 3) %>%
  select(contains("length"))
# A tibble: 4 x 2
  sepal_length petal_length
         <dbl>        <dbl>
1          4.5          1.3
2          4.4          1.4
3          4.9          3.3
4          4.9          4.5

dplyr is much more!

  • mutate creates new columns
iris_mutated = mutate(cleanIris,
      newVar = sepal_length + petal_length,
      newNewVar = newVar*2) %>%
  rename(lengthSum = newVar,
         lengthSumSq = newNewVar)

iris_mutated
# A tibble: 150 x 7
  sepal_length sepal_width petal_length petal_width    species lengthSum
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>     <dbl>
1          5.8         2.6          4.0         1.2 versicolor       9.8
2          5.8         2.7          5.1         1.9  virginica      10.9
3          6.5         3.0          5.5         1.8  virginica      12.0
4          6.3         2.9          5.6         1.8  virginica      11.9
5          5.1         3.8          1.9         0.4     setosa       7.0
# ... with 145 more rows, and 1 more variables: lengthSumSq <dbl>
  • group_by + summarise will create summary statistics for grouped variables
bySpecies = cleanIris %>%
  group_by(species)

bySpecies
# A tibble: 150 x 5
# Groups:   species [3]
  sepal_length sepal_width petal_length petal_width    species
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>
1          5.8         2.6          4.0         1.2 versicolor
2          5.8         2.7          5.1         1.9  virginica
3          6.5         3.0          5.5         1.8  virginica
4          6.3         2.9          5.6         1.8  virginica
5          5.1         3.8          1.9         0.4     setosa
# ... with 145 more rows
bySpecies %>%
  summarise(meanSepalLength = mean(sepal_length))
# A tibble: 3 x 2
     species meanSepalLength
       <chr>           <dbl>
1     setosa           5.006
2 versicolor           5.936
3  virginica           6.588
  • Special select
bySpecies %>%
  summarise_if(is.numeric,
               funs(m = mean))
# A tibble: 3 x 5
     species sepal_length_m sepal_width_m petal_length_m petal_width_m
       <chr>          <dbl>         <dbl>          <dbl>         <dbl>
1     setosa          5.006         3.428          1.462         0.246
2 versicolor          5.936         2.770          4.260         1.326
3  virginica          6.588         2.974          5.552         2.026
  • left_join for merging data
flowers = data.frame(species = c("setosa", "versicolor", "virginica"),
                     comments = c("meh", "kinda_okay", "love_it!"))

## cleanIris has the priority in this join operation
iris_comments = left_join(cleanIris, flowers, by = "species")

## Randomly sampling 6 rows 
sample_n(iris_comments, 6) 
# A tibble: 6 x 6
  sepal_length sepal_width petal_length petal_width    species   comments
         <dbl>       <dbl>        <dbl>       <dbl>      <chr>     <fctr>
1          6.0         3.4          4.5         1.6 versicolor kinda_okay
2          5.6         3.0          4.5         1.5 versicolor kinda_okay
3          5.8         2.7          5.1         1.9  virginica   love_it!
4          4.8         3.0          1.4         0.1     setosa        meh
5          5.1         3.5          1.4         0.2     setosa        meh
6          5.7         2.9          4.2         1.3 versicolor kinda_okay

Checking if we cleaned the data properly (optional)

  • arrange for ordering rows
arrangeCleanIris = cleanIris %>% 
  arrange(sepal_length, sepal_width, petal_length, petal_width)

## The original data
arrangeIris = iris %>% 
  clean_names() %>% 
  arrange(sepal_length, sepal_width, petal_length, petal_width)
  • We sorted both the processed dirtyIris data and the arranged iris data. Are they the same as I claim to be?
## The `Species` column is character or factor
all.equal(arrangeCleanIris, arrangeIris) 
[1] "Incompatible type for column `species`: x character, y factor"
arrangeIris = arrangeIris %>% 
  mutate(species = as.character(species)) 

## Great! 
all.equal(arrangeCleanIris, arrangeIris)
[1] TRUE
## identical is stronger version of all.equal
## arrangeCleanIris is a tibble
## but arrangeIris is a data.frame
identical(arrangeCleanIris, arrangeIris) 
[1] FALSE
class(arrangeCleanIris)
[1] "tbl_df"     "tbl"        "data.frame"
class(arrangeIris)
[1] "data.frame"
## Now, they are exactly the same!
identical(arrangeCleanIris, arrangeIris %>% as.tibble)
[1] TRUE

S6: ggplot2: best visualisation package EVER

ggplot2 is for systematic plots

  • See tutorial sheet. (https://gauss17gon.shinyapps.io/ggplot2_basic_tutorial/)

  • Di Cook - the real reason that you should use ggplot2 is because its design will force you to use a certain grammar when producing a plot. So much so, every plot you produce is actually a statistic.

  • Every characteristic (e.g. axes, colour, size) of your graph is a column in a data.frame. i.e. you can't plot vector x in line 1 and vector newx in line 100.

library(ggplot2)
p1 =
  iris %>%
  ggplot(aes(x = Sepal.Length,
             y = Sepal.Width)) +
  geom_point(size = 3)

p1

  • Each feature on the plot can be manipulated directly.
p2 =
  iris %>%
  ggplot(aes(x = Sepal.Length,
             y = Sepal.Width,
             colour = Petal.Length,
             shape = Species)) +
  geom_point(size = 3) +
  scale_color_distiller(palette = "Spectral") +
  theme_bw() +
  theme(legend.position = "bottom")

p2

Reshaping data to suit visualisation

  • Is the data in the best possible form to be visualised?
  • Suppose we wish to visualise each of the 4 variable (petal/sepal-width/length) for 3 levels of species of iris. i.e. The x-axis should have a \(4 \times 3\) grouping.
  • The current \(150 \times 5\) data.frame is bad for this purpose.
  • To achieve this we use tidyr::gather to create two new column in place of the 4 numerical columns:
    1. variable, which indicates (petal/sepal-width/length)
    2. value, the corresponding value for that variable for the Species
## Initial plotting to decide which variable should be matched to which variable
gatherData = iris %>%
  gather(key = variable,
         value = value,
         -Species) ## This is the column we don't want to throw away. 

dim(gatherData)
[1] 600   3
head(gatherData)
  Species     variable value
1  setosa Sepal.Length   5.1
2  setosa Sepal.Length   4.9
3  setosa Sepal.Length   4.7
4  setosa Sepal.Length   4.6
5  setosa Sepal.Length   5.0
6  setosa Sepal.Length   5.4
gatherData %>%
  ggplot(aes(x = variable,
             y = value,
             fill = Species)) +
  geom_boxplot() + 
  theme_minimal()

S7: tidyverse (optional)

tidyverse: tidy data + tidy coding

  • tidyverse is a collection of 20 packages (still expanding!) built on the philosophy that both your data and coding should be tidy.

  • These functions:
    • They integrate against each other well.
    • They are NOT ad-hoc programming solutions.
    • They were built with a philosophy of being "tidy" and "reproducible" in mind.
    • They will always throw errors at you, if you don't have a thorough understanding of your data.

A gallery (optional)

S8: Conclusion

Advice in the future

  • What I discussed today is really about being "tidy" and having a reproducible pipeline. The same principle applies when you are doing any projects.

  • Use RStudio + RMarkdown to document your codes. Controversial: plain R isn't flexible enough to handle large projects and documentations.

  • You don't have to adopt everything I taught you today! Find the component that you like to use and adapt that into your work routine. (Hint: start with cheatsheets and build up gradually.)

  • Take time to re-analyse an old dataset.

  • The design/implementation philosophy is not for everyone. e.g. tibble does not have row names, which makes sense if you data is large but less so if your data is small.

  • Learn the core functions in every package first! All modern package has a tutorial or vignette page.

  • Don't forget the theories and interpretations! This is a course about statistics after all, not Cranking-Out-Numbers-Less-Than-0.05-And-Reject-Null-Hypothesis-101.

Session Info and References

sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.3

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2    janitor_0.3.0   dplyr_0.7.2     purrr_0.2.2.2  
 [5] readr_1.1.1     tidyr_0.6.3     tibble_1.3.3    ggplot2_2.2.1  
 [9] tidyverse_1.1.1 knitr_1.16     

loaded via a namespace (and not attached):
 [1] reshape2_1.4.2     haven_1.1.0        lattice_0.20-35   
 [4] colorspace_1.3-2   htmltools_0.3.6    viridisLite_0.2.0 
 [7] yaml_2.1.14        plotly_4.7.0       rlang_0.1.1       
[10] foreign_0.8-69     glue_1.1.1         RColorBrewer_1.1-2
[13] modelr_0.1.0       readxl_1.0.0       bindr_0.1         
[16] plyr_1.8.4         stringr_1.2.0      munsell_0.4.3     
[19] gtable_0.2.0       cellranger_1.1.0   rvest_0.3.2       
[22] htmlwidgets_0.9    psych_1.7.5        evaluate_0.10.1   
[25] labeling_0.3       forcats_0.2.0      httpuv_1.3.5      
[28] crosstalk_1.0.0    parallel_3.4.1     broom_0.4.2       
[31] Rcpp_0.12.12       xtable_1.8-2       scales_0.4.1.9002 
[34] backports_1.1.0    jsonlite_1.5       mime_0.5          
[37] mnormt_1.5-5       hms_0.3            digest_0.6.12     
[40] stringi_1.1.5      shiny_1.0.3        grid_3.4.1        
[43] rprojroot_1.2      tools_3.4.1        magrittr_1.5      
[46] lazyeval_0.2.0     pkgconfig_2.0.1    data.table_1.10.4 
[49] xml2_1.1.1         lubridate_1.6.0    assertthat_0.2.0  
[52] rmarkdown_1.6      httr_1.2.1         R6_2.2.2          
[55] nlme_3.1-131       compiler_3.4.1